Skip to content

feat: add deterministic seed_random management command for synthetic …#3041

Open
abdihakim92x1 wants to merge 40 commits intomainfrom
task/CDD-3154-seed-random-data
Open

feat: add deterministic seed_random management command for synthetic …#3041
abdihakim92x1 wants to merge 40 commits intomainfrom
task/CDD-3154-seed-random-data

Conversation

@abdihakim92x1
Copy link
Copy Markdown
Contributor

…dev data

Description

This PR includes the following:

  • Introduces python manage.py seed_random
  • Supports dataset selection: cms | metrics | both
  • Supports scale options: small | medium | large
  • Deterministic seeding via --seed
  • Optional --truncate-first for safe reset of metrics data
  • Reuses build_cms_site for CMS baseline
  • Bulk inserts CoreTimeSeries and APITimeSeries for performance
  • Designed for personal dev environment seeding

Fixes #CDD-3154 Partly


Type of change

Please select the options that are relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Tech debt item (this is focused solely on addressing any relevant technical debt)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests at the right levels to prove my change is effective
  • I have added screenshots or screen grabs where appropriate
  • I have added docstrings in the correct style (google)

@abdihakim92x1 abdihakim92x1 self-assigned this Mar 4, 2026
@abdihakim92x1 abdihakim92x1 requested a review from a team as a code owner March 4, 2026 00:31
Copy link
Copy Markdown

@jeanpierrefouche-ukhsa jeanpierrefouche-ukhsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good code! Please see my comments.

@jrdh jrdh force-pushed the task/CDD-3154-seed-random-data branch from 50b07fa to a70c824 Compare March 6, 2026 09:18
Copy link
Copy Markdown
Contributor

@jrdh jrdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

Stratum.objects.all().delete()

@classmethod
def _seed_time_series_rows(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a docstring please?

@jrdh jrdh force-pushed the task/CDD-3154-seed-random-data branch 2 times, most recently from 4d2477c to c99d952 Compare March 9, 2026 13:54
@abdihakim92x1
Copy link
Copy Markdown
Contributor Author

abdihakim92x1 commented Mar 10, 2026

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

Thanks, agree this AC calls for generating payloads into the ingestion S3 bucket rather than writing directly to metrics tables.
The current implementation intentionally focused on quickly unblocking realistic local/dev data setup and validating model-level seeding behavior first, which is why it writes directly to DB.
I agree this does not satisfy the ingestion-path AC as written. I propose a follow-up change to pivot this command to produce JSON files and upload to the target env ingestion bucket, so we exercise ingestion + validation end-to-end and avoid duplicating DB write logic.
If helpful, I can raise that as a separate PR to keep scope/review risk controlled.

@abdihakim92x1 abdihakim92x1 requested a review from jrdh March 10, 2026 08:59
Copy link
Copy Markdown
Contributor

@jrdh jrdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as the few comments throughout, it'd also be good to add a system test added in the same vein as the build_cms_site system tests - i.e. use the actual command with the database and confirm that we get what we expect out of it using the API / querying the database.

Otherwise looks decent so far, though I haven't had a chance to test it in my aws env yet. I'll reply to the comment about the AC in a mo.

truncate_first: bool,
progress_callback: Callable[[str], None] | None = None,
) -> dict[str, int]:
"""Seed supporting metric models and time series rows for the selected scale."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add param docs please?

@classmethod
def _seed_theme_hierarchy(cls) -> tuple[list[Theme], list[SubTheme], list[Topic]]:
theme_names, sub_theme_rows, topic_rows = cls._build_theme_hierarchy_records()
themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing locally, if I run uhd bootstrap all and then run this seeding command without any truncation (e.g. python manage.py seed_random --dataset metrics) I get integrity errors. Try seed 1773741316 and you should get the same. Essentially the hierarchy created here needs to either take the existing hierarchy into account when it's generated or when bulk creating the records in the db we need to be ok with and manage the integrity errors.

Here's the stack trace:

((.venv) ) ➜  data-dashboard-api git:(task/CDD-3154-seed-random-data) python manage.py seed_random --dataset metrics                 
Seed used: 1773741316
Seeding metrics dataset...
Preparing metric taxonomy and geography records...
Traceback (most recent call last):
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute
    return super().execute(query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.IntegrityError: UNIQUE constraint failed: data_theme.name

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/josh/work/data-dashboard-api/manage.py", line 23, in <module>
    main()
  File "/home/josh/work/data-dashboard-api/manage.py", line 19, in main
    execute_from_command_line(sys.argv)
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 420, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 464, in execute
    output = self.handle(*args, **options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 96, in handle
    counts = self._seed_metrics_data(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 133, in _seed_metrics_data
    themes, sub_themes, topics = cls._seed_theme_hierarchy()
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 259, in _seed_theme_hierarchy
    themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 376, in _bulk_create
    return model.objects.bulk_create(list(records))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 825, in bulk_create
    returned_columns = self._batched_insert(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1901, in _batched_insert
    self._insert(
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1873, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1882, in execute_sql
    cursor.execute(sql, params)
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 122, in execute
    return super().execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute
    return self._execute_with_wrappers(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers
    return executor(sql, params, many, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 100, in _execute
    with self.db.wrap_database_errors:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute
    return super().execute(query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.IntegrityError: UNIQUE constraint failed: data_theme.name

geography=geography,
stratum=stratum,
age=age,
sex=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we randomise this between the currently allowed values (male, female, all)?

"small": {"geographies": 5, "metrics": 10, "days": 30},
"medium": {"geographies": 20, "metrics": 50, "days": 180},
"large": {"geographies": 100, "metrics": 200, "days": 365},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add a comment here to give an indication of the scale of these. From testing myself, small gets you about 1500 time series records, medium gets you 180000, and large gets you 7.3 million.

@jrdh
Copy link
Copy Markdown
Contributor

jrdh commented Mar 19, 2026

There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.

Thanks, agree this AC calls for generating payloads into the ingestion S3 bucket rather than writing directly to metrics tables. The current implementation intentionally focused on quickly unblocking realistic local/dev data setup and validating model-level seeding behavior first, which is why it writes directly to DB. I agree this does not satisfy the ingestion-path AC as written. I propose a follow-up change to pivot this command to produce JSON files and upload to the target env ingestion bucket, so we exercise ingestion + validation end-to-end and avoid duplicating DB write logic. If helpful, I can raise that as a separate PR to keep scope/review risk controlled.

I'd rather the original brief was fulfilled than some work merged and then another ticket created to do what the original work should have done. Given none of this functionality exists yet and there's no urgent need for this, that I'm aware of, the only downside I can see is there will be chunks of code in this PR we don't need any more. From your investigations to get to this point did you find anything that would make using the S3 bucket ingestion pathway problematic to implement?

@jrdh jrdh force-pushed the task/CDD-3154-seed-random-data branch from b274ea0 to 3c728b1 Compare March 23, 2026 08:00
@abdihakim92x1 abdihakim92x1 force-pushed the task/CDD-3154-seed-random-data branch from 6a4ee5a to 75cc893 Compare March 27, 2026 12:39
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants